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AlkB and CYP153 are important alkane hydroxylases responsible for aerobic alkane degradation in 
bioremediation of oU-poUuted environments and microbial enhanced oU recovery. Since their distribution 
in nature is not clear, we made the investigation among thus-far sequenced 3,979 microbial genomes and 
137 metagenomes from terrestrial, freshwater, and marine environments. Hundreds of diverse alkB and 
CYP153 genes including many novel ones were found in bacterial genomes, whereas none were found in 
archaeal genomes. Moreover, these genes were detected with different distributional patterns in the 
terrestrial, freshwater, and marine metagenomes. Hints for horizontal gene transfer, gene duplication, and 
gene fusion were found, which together are likely responsible for diversifying the alkB and CYP153 genes 
adapt to the ubiquitous distribution of different alkanes in nature. In addition, different distributions of 
these genes between bacterial genomes and metagenomes suggested the potentially important roles of 
unknown or less common alkane degraders in nature. 

Bacterial alkane degradation is important for the bioremediation of petroleum-contaminated environments 
as well as for microbial enhanced oil recovery'. Alkane hydroxylases (AHs) are the key enzymes in aerobic 
degradation of alkanes by bacteria. These enzymes hydroxylate alkanes to alcohols, which are further 
oxidized to fatty acids and catabolized via the bacterial |3-oxidation pathway^. 

The integral-membrane alkane monooxygenase (AlkB) -related AHs are so-far the most commonly found AHs 
distributed in both Gram-negative and Gram-positive bacteria'' ''. Rubredoxin and rubredoxin reductase are the 
essential electron transfer components needed for alkane hydroxylation by AlkB^. The substrates of AlkB are 
generally «-alkanes ranging from CIO to C16''. However, in some Actinomycetes, AlkB-type AHs could hydro- 
xylate «-alkanes with the chain lengths up to C32 when they were fused with rubredoxin protein' The 
cytochrome P450 CYP153 family is another type of AH for degradation of short- and medium-chain-length 
w-alkanes, which are commonly found in alkane-degrading bacteria lacking AlkB'". A number of bacteria have 
multiple AHs which were proven to potentially expand the «-alkane range of the host strain'". For example, the 
co-existence of AlkB and CYP153 was found in Dietzia sp. DQ12-45-lb"''^, as well as the co-existence of multiple 
AHs found in Amycolicicoccus subflavus DQS3-9A1^ Some Rhodococcus strains were found to contain more 
than one alkB homologous genes, which have different substrate ranges and induction styles''' '^ 

It was reported that more than 60 genera of aerobic bacteria and 5 genera of anaerobic bacteria are able to 
degrade n-alkanes"". Although AHs have been intensively studied since the first enzyme was identified in 1977", 
only some alkane degrading bacteria have been subjected to AH analysis. Recently, AH genes have been dis- 
covered in new bacterial strains"* '", indicating the presence of unknown «-alkane-degrading bacteria as well as 
unknown AHs in nature that could be important for natural «-alkane metabolism as well as for industrial 
application such as bioremediation and microbial enhanced oil recovery. However, what are these potential 
n-alkane degrading bacteria? How different are their AH systems? How are they distributed in nature? Why do 
some of them have multiple AHs genes, whereas others do not? 

To answer these questions, we searched for the genes coding for AlkB and CYP153 proteins in the microbial 
genome and metagenome data deposited in GenBank and the Integrated Microbial Genomes (IMG) system. We 
then evaluated the diversity of AHs among the microorganisms and different environments, as well as the possible 
origins of multiple AH genes in one strain, such as gene duplication and gene transfer. 

Results 

Features of alkB and CYP153 genes in microbial genomes. Distribution in genomes. In the total 2,069 complete 
and 1,910 draft microbial genomes, 458 genes for ADcB and 130 genes for CYP153 were found in 369 and 87 
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Table 1 | Distribution of alkB and CYPl 53 genes in 


microbial genomes 










No. of genomes 


No. of genomes 


No. of alkB 


No. of genomes 


No. of CYPl 53 


Phylum (Class) 


sequenced 


containing alkB 


genes found 


containing CYPl 53 


genes found 


Actinobacteria 


424 


1 ?n 

1 oyj 


1 A9 

1 




? 1 

O 1 


Bacteroidetes 


220 


1 A 


1 A 
1 A 




INU 


Alphaproteobacteria* 


348 


55 


76 


38 


63 


Betaproteobacteria* 


230 


70 


75 


3 


3 


Gammaproteobacteria* 


835 


93 


125 


18 


32 


Deltaproteobacteria* 


126 


2 


2 


ND 


ND 


Spirochaetes 


73 


4 


4 


ND 


ND 


Planctomycetes 


1 1 


ND 


ND 


1 


1 


ND, not detected. 

*: Classes are provided for the phylum Proteobacteria. 



genomes respectively. Neither the alkB nor the CYP153 gene was 
found in archaeal genomes. At the phylum level, the alkB genes were 
only found in Proteobacteria, Actinobacteria, Bacteroidetes, and 
Spirochaetes. Among them, 278 and 162 alkB genes were found in 
220 Proteobacteria and 130 Actinobacteria genomes (Table 1), which 
belonged to at least 51 and 23 genera, respectively (Fig. SI). CYP153 
genes were found only in Proteobacteria, Actinobacteria and 
Planctomycetes. About 98 and 31 CYP153 genes were found in 59 
Proteobacteria and 26 Actinobacteria genomes (Table 1), which 
belonged to at least 20 and 8 genera, respectively (Fig. S2). Of the 
sequenced genomes, relatively more Actinobacteria contained alkB 
or CYP153 genes (30.7% and 6.1%, respectively) than Proteobacteria 
did (13.3% and 3.6%, respectively). 

At the genus level, alkB and CYPl 53 genes were detected in 85 and 
30 genera, respectively (Fig. SI and S2). Among all the 27 genera that 
had at least four sequenced genomes, alkB genes were detected in all 
or most of the sequenced genomes belonging to Frankia, Gordonia, 
Mycobacterium, Rhodococcus, Rhodobacter, Burkholderia, Acineto- 
bacter, Legionella and Marinobacter, suggesting that alkB genes 
could be the core genes shared by these genera (Fig. S3). Similarly, 
CYPl 53 genes were found in all the sequenced genomes of 
Bradyrhizobium, Rhodopseudomonas, Caulobacter and Novosphin- 
gobium, also indicating their potential core roles in these genera (Fig. 
S4). In contrast, only one, one, and four genomes were found to have 
alkB genes in 10 Corynebacterium, 40 Streptomyces and 60 
Enterobacter sequenced genomes, respectively. Moreover, only one 
genome was found to contain the CYP153 gene in the 34 
Acinetobacter and 76 Burkholderia genomes sequenced, respectively. 

Architectures of AlkB and CYP153. Most of the deduced proteins of 
alkB genes detected had only an AH domain belonging to mem- 
brane-FADS-like superfamily and required rubredoxin and rubre- 
doxin reductase genes encoded in separate open reading frames in 
order to function catalytically. However, 23 genes encoding multiple- 
domain AlkB proteins were found in this work (Fig. 1). Nineteen of 
them encoded AlkB-rubredoxin fused proteins, having two con- 
served domains comprising an N-terminal alkane-hydroxylase 
domain and a C-terminal rubredoxin domain (Table SI). 
Seventeen of the 19 genes encoding AlkB-rubredoxin fused proteins 
were found in Actinobacteria like Streptomyces, Aeromicrobium, 
Gordonia, Amycolatopsis, Janibacter, Pseudonocardia, Saccharo- 
monospora and Dietzia. The remaining four genes encoding mul- 
tiple-domain AlkB proteins were found in Leptospira, Limnobacter 
and Polaromonas, comprising an N-terminal ferredoxin domain, a 
ferredoxin reductase domain and a C-terminal AH domain (Table 
SI), which has never been reported to our knowledge. 

Three CYPl 53 genes encoding similar fusion proteins were also 
found in Gordonia araii NBRC 100433, G. polyisoprenivorans NBRC 
16320 and G. polyisoprenivorans VH2 consisted of an N-terminal 
cytochrome P450 domain, a ferredoxin reductase domain, and a C- 
terminal ferredoxin domain. 



Presence of multiple alkane hydroxylase genes. Within the 369 alkB- 
containing and 87 CYP153-containing genomes, 73 and 32 genomes 
were detected to have multiple copies of alkB and CYP153 homo- 
logous genes (Table S2 and S3). For example, up to six alkB homo- 
logous genes and five CYP153 homologous genes were found in 
Rhodococcus erythropolis SK121 and Parvibaculum lavamentivorans 
DS-1, respectively. These multiple copies of genes within one gen- 
ome shared sequence similarities ranging from 27.7% to 99.7% for 
alkB genes and 42.4% to 100% for CYP153 genes. 

Furthermore, both the alkB and CYP153 genes were simulta- 
neously detected in 38 genomes that belonged to 16 genera, including 
Marinobacter, Alcanivorax, Rhodococcus and Mycobacterium (Table 
S4). However, the distribution of these two genes was uneven: 10.3% 
of the total alkB containing genomes harbored CYPl 53 genes, which 
was in contrast to the 43.7% of the CYP153-containing genomes 
harboring alkB genes (Fig. S5a). These unbalanced distributions were 
different in different bacterial phyla (Fig. S5b-e). For example, 24 
Actinobacteria genomes had both the alkB and CYP153 genes, which 
occupied 18.5% and 88.9% of the total Actinobacteria genomes con- 
taining alkB and CYP153 genes, respectively. 

Phylogenies of alkB and GYP 153 genes in microbial genomes. The 

phylogenetic and comparative genomic analyses revealed that the 
sequences of AlkB AHs could be clustered into eight clusters (I- 
VIII) (Fig. 2a). Cluster I included all the sequences from 
Actinobacteria and one sequence from a Gammaproteobacteria 
Salinisphaera shabanensis E1L3A. Cluster II mainly included most 
of the sequences from Betaproteobacteria and a few sequences from 
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Figure 1 | Conserved domain architecture of AlkB and CYP153. AlkB, 
AlkB-like alkane hydroxylase; Rd, rubredoxin; Per, ferredoxin; FNR, 
ferredoxin reductase; CypX, cytochrome P450. 
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Gammaproteobacteria like Marinobacter, Pseudomonas, and Alcani- 
vorax. Sequences of Cluster III were all from Gammaproteobacteria, 
including all the sequences from Acinetobacter, and sequences from 
some Marinobacter and Psychrobacter species. Sequences in Cluster 
IV were more diverse, which were from Betaproteobacteria, 
Gammaproteobacteria, and Alphaproteobacteria. Most of the 
sequences in Cluster V were from Bacteroidetes, together with a 
few sequences from Gammaproteobacteria and Spirochaetes. 
Sequences in Cluster VI were mainly from Gammaproteobacteria, 
including those from Pseudomonas aeruginosa, Alcanivorax and 
Marinobacter. Sequences in Cluster VII were all from 
Alphaproteobacteria and clustered into two groups (Fig. 2a and 
Fig. S6). Cluster VIII was distinct from the other clusters and 
included mfaced sequences from Alphaproteobacteria, Betaproteo- 
bacteria, Gammaproteobacteria, and Spirochaetes. Moreover, 
sequences in Cluster VIII showed low similarities with sequences 
in other clusters. For example, two putative alkB genes in 
Burkholderia sp. Chl-1, one of which (ZP_10031959.1) was 
embedded in Cluster II and another one (ZP_10037453.1) was 
embedded in Cluster VIII, shared a very low sequence similarity 
(27.7%) with each other. The functions of the sequences in Cluster 
VIII are not clear, since none of the gene sequences in this cluster was 
functionally characterised by experiments such as heterogenous 
expression or gene knockout analysis. However, their deduced 
proteins contained all the residues that are conserved in aDcane 
hydroxylases. 

The topology of the phylogenetic tree based on the AlkB sequences 
was largely matched but was different from that based on the 16S 
rRNA genes. For example, the Proteobacteria AlkB sequences 
formed a number of distinct clusters, and AlkB sequences from 
Gammaproteobacteria were distributed among five different clusters 
(Fig. 2b). 

The CYP153 sequences could be clustered into four major clusters 
(I-IV) (Fig. 3a). Cluster I included two groups with sequences from 
Actinobacteria and Gammaproteobacteria such as Alcanivorax, 
Acinetobacter, and Marinobacter, respectively. Cluster II included 
most of the remaining Gammaproteobacterial sequences and a few 
sequences from Alphaproteobacteria. Sequences in Clusters III and 
IV were mainly from Alphaproteobacteria such as Bradyrhizobium, 
Rhodopseudomonas, and Caulobacter, and mixed with a few 
sequences from Mycobacterium (Actinobacteria), Polaromonas 
(Betaproteobacteria), Dickeya (Gammaproteobacteria), Burkholde- 
ria (Betaproteobacteria), and Planctomyces (Planctomycetes). 
Again, the topology of the CYP153-based phylogenetic tree was dif- 
ferent from that of the 16S rRNA gene-based phylogenetic tree. For 
example, within the Alphaproteobacteria, sequences belonging to the 
GYP 153 A group were generally in Cluster III, whereas sequences 
belonging to the CYP153C and CYP153D groups were in Cluster 
IV in the CYP153 phylogenetic tree (Fig. 3b). 

AH genes could transfer between organisms. For example, the 
alkB operon was found in plasmid OCT in Pseudomonas putida, 
which could easily be subjected to transfer. In P. mendocina ymp, a 
putative alkB gene (YP_001185946.1) was located in a predicted 
genomic island (GI) and had 100% sequence identity with fl//cB genes 
(CAB54050.1) in the P. putida OCT plasmid. In addition, 
YP_001 185946.1 had a much lower G + C content (45.9%) than its 
host genome (64.7%), also indicating a possible HGT event. Similar 
HGT could also be found for GYP 153 genes. From the clustering in 
the phylogenetic tree, the potentially same ancient ancestor of 
GYP 153 genes could be found from Alcanivorax, Marinobacter, 
Acinetobacter, and Actinobacteria (Fig. 3b). Of all the 26 GYP 153 
gene-containing Actinobacteria genomes (both complete and draft), 
18 had over 10% difference in the G+C contents between the 
GYP 153 genes and their host genomes (Table 2). Furthermore, of 
all the 15 complete Actinobacteria genomes, predicted GIs contain- 
ing CYP153 genes were found in eight genomes, and plasmids 



containing CYP153 genes were found in three genomes. All the 
results indicate the potential common HGT events for GYP 153 genes 
in Actinobacteria. 

Distribution of alkB and CYP153 genes in metagenomes. Although 
the sampling sites, metagenomic sequencing and analysis strategies 
among different samples were obviously different, the average num- 
bers of AH genes in the metagenomes from terrestrial, freshwater and 
marine samples could more or less reflect their distribution in the 
three habitats. 

From the thus-far available 42 terrestrial, 35 freshwater, and 60 
marine metagenomes, the average microbial compositions were cal- 
culated (Fig. S7). The results indicated that the terrestrial microbial 
community was dominated by Proteobacteria (34.9%), Actinobac- 
teria (19.5%), and Cyanobacteria (22%). Proteobacteria (56.9%), 
Cyanobacteria (12.5%), Bacteroidetes (14.2%) and Actinobacteria 
(9.6%) were abundant in freshwater environments. In marine meta- 
genomes, bacteria were mainly from Proteobacteria (51.6%), 
Bacteroidetes (13.4%) and Cyanobacteria (12.1%). 

alkB genes in metagenomes. A total of 301, 144, and 524 alkB genes 
were found out of 20,162,506, 4,999,959, and 9,254,226 total proteins 
predicted in the terrestrial, freshwater and marine metagenomic 
datasets, respectively. Phylogenetic trees based on these alkB gene 
sequences were constructed with sequences from the genomes and 
the 28 reference sequences (Table S6). Results indicated that the 
terrestrial AlkB sequences were mainly clustered with the sequences 
from Actinobacteria, Gammaproteobacteria and Alphaproteobac- 
teria genomes derived (Fig. S8a), corresponding to Clusters I, VI 
and VII of the genome AlkB phylogenetic tree, respectively, with 
some sequences clustered with those from Bacteroidetes in Cluster 
V (Fig. 2a). The freshwater AlkB sequences were mainly clustered 
with those from Gammaproteobacteria and Bacteroidetes genomes 
(Clusters III, IV, V and VI) (Fig. S8b), with some sequences distrib- 
uted in Clusters VIII. Only two freshwater sequences were related to 
alkB sequences from Actinobacteria in Cluster I. The marine AlkB 
sequences were mainly distributed in Clusters III, IV, V, and VIII, 
being closely related to sequences from Gammaproteobacteria 
and Bacteroidetes. No sequence was found in Cluster I with 
Actinobacteria, but at least 22 other sequences formed a new cluster 
distant from any of the seven main clusters (Fig. S8c). In general, the 
phylogeny of the alkB gene sequences revealed that only a few alkB 
sequences retrieved from the three metagenomic databases were 
closely related (>75% amino acid identities) to the sequences found 
in microbial genomes or previously identified, suggesting the pres- 
ence of numerous novel alkB genes in the different environments, 
especially for the marine environment. 

Since it is hard to understand the taxonomy of the alkB genes from 
the above phylogenetic trees based on the bacterial genomes and 28 
reference sequences, taxonomic analysis was conducted by compar- 
ing the deduced proteins of all alkB genes against the NR database in 
GenBank using a BLASTP search and the MEGAN program with the 
last common ancestor algorithm (Fig. S9). Although it is difficult to 
assign unknown metagenome-derived AH genes to taxa accurately 
without examination of flanking gene content because of HGT, here 
we propose to provide a picture of the potential taxonomic affilia- 
tions of metagenome derived AH genes. Our binning approach can 
be applied in order to facilitate broad comparisons among different 
samples. 

The results showed that only 100 (33.3%), 25 (17.4%) and 52 
(9.9%) of the 301, 144, and 524 alkB sequences from the terrestrial, 
freshwater and marine metagenomes, respectively, encoded proteins 
that were highly similar (>75%) to proteins deposited in the NR 
database. In contrast, 103 (34.2%), 34 (23.6%), and 290 (55.3%) 
sequences were < 50% identical to the deposited proteins, respect- 
ively (Fig. SIO). In addition, only 177 (58.8%), 89 (61.8%), and 363 
(69.3%) of the alkB sequences in terrestrial, freshwater and marine 
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Figure 2 | Phylogenetic distribution of alkB genes based on amino acid sequences analysis. A: Major clusters of alkB genes. B: Comparison of alkB genes 
and 16S rRNA genes phylogenies. alkB (left), 16S rRNA (right). AH the sequences were aligned and analyzed by the neighbor-joining method using ARB. 
The trees were bootstrapped with 1 000 replicates. Bootstrap values of > 50% are indicated at the respective nodes. The scale bar indicates the percentage of 
sequence divergence. 
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Figure 3 | Phylogenetic distribution of CYP153 genes based on amino acid sequences analysis. A: Major clusters of CYP153 genes. B: Comparison of 
CYP153 genes (left) and 16S rRNA genes (right) phylogenies. AH the sequences were aligned and analyzed by the neighbor-joining method using ARB. 
The trees were bootstrapped with 1 000 replicates. Bootstrap values of > 50% are indicated at the respective nodes. The scale bar indicates the percentage of 
sequence divergence. 
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Table 2 | Location 


and G+C contents of CYPl 53 genes found in 


Actinobacteria 


genomes 
















CYPl 53 gene 


G+C content 




Accession No. 


Source 


Location" 


els'- 


G+C content*" 


of genome 


ratio^ 


VP C\r\A AO Ar\AC\ 1 
T r_UU44V4UOU. 1 


Amycolicicoccus subrlavus DQoo-9Al 






Do /o 


AO 99/ 
OZ.Z /o 


1 07 


T r_uu44y J jzu. i 


Amycoficicoccus suhflovus DQS3-9A1 


c 


+ 


9°/ 


A 9 9°/ 
0 Z . Z /o 


1 1 1 

1 . 1 1 
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Kl A 
INA 


A 1 9°/ 
O 1 . Z /o 


71 1 °/ 
/ 1 . 1 /o 


1 1 A 
1 . 1 0 
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/ 1 . 1 /o 


1 1 
1 . 1 0 


Lr_Ut5UZZV 14. 1 


Dietzio cinnamea P4 (D) 


MA 
INA 


Kl A 
INA 


A 1 0°/ 
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p 
r 


Kl A 

INA 


oy /o 
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0/ .J /o 


1 1 A 
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/Mycobacterium abscessus ATCC 1 9977 


c 




S7 

J / . O /o 
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Mycobacterium chubuense NBB4 


p 
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Mycobacterium sp. KMS 
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f- 


+ 
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+ 


'=;7°/ 
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Mycobacterium rhodesiae NBB3 
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ZP 08199554.1 


Nocardioidaceae bacterium Broad- 1 (D) 


NA 


NA 
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1.20 
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Nocardio cyriadgeorgica GUH-2 


C 


+ 
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Patulibacter sp. Ill (D) 


NA 


NA 


64.6% 


74.1% 
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Rhodococcus erythropolis PR4 


P 
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NA 


NA 


58.9% 


62.4% 


1.07 



Location of CYPl 53 genes. C, genes located in chromosome; P, genes located in plasmid; NA, no data available. 
Genes found in predicted genomic islands (GIs). +, genes located in Gl; -, genes not located in Gl; NA, no data available. 
^\ CYPl 53 gene versus overage genome G+C content'. 



metagenomes could be aligned to a phylum. All these results sug- 
gested a vast amount of novel a\kE genes in the different 
environments. 

Among the AlkB sequences assigned to a phylum, 69 (39.0%), 61 
(68.5%), and 220 (60.6%) sequences were from Proteobacteria in 
comparison with Proteobacteria occupying 34.9%, 56.9% and 
51.6% of the total metagenomic communities in the terrestrial, fresh- 
water and marine metagenomes, respectively (Fig. 4a and Fig. S7). In 
addition, 22 (25.8%) and 130 (36.9%) AlkB sequences in the fresh- 
water and marine metagenomes were found to be related to 
Bacteroidetes, although the relative abundances of Bacteroidetes 
were only 14.2% and 13.4% in the two environments, respectively 
(Fig. 4a and Fig. S7). In contrast, 97 (54.8%) and 5 (5.6%) sequences 
were found related to Actinobacteria, in comparison with 
Actinobacteria occupying 19.5% and 9.6% of the total metagenomic 
communities in the terrestrial and freshwater metagenomes, respect- 
ively (Fig. 4a and Fig. S7). These results suggest that AlkB sequences 
from Actinobacteria were enriched in terrestrial metagenomes, 
whereas those from Bacteroidetes were enriched in aquatic environ- 
ments. No AlkB sequence from Actinobacteria was found in marine 
metagenomes. 

Furthermore, only about 22.6%, 15.3%, and 30.9% of the total 301, 
144, and 524 AlkB sequences in the terrestrial, freshwater and marine 
metagenomes, respectively, could be assigned to a genus, including 
Conexibacter, Pseudomonas, Mycobacterium, Acidiphilium, and 
Polaribacter (Fig. S9), many of which are not known as alkane degra- 
ders. Among them, sequences related to Polaribacter were found in 
all three habitats. Sequences related to Conexibacter, Mycobacterium, 
Hyphomicrobium and Rhodococcus were only found in terrestrial 



metagenomes, and those related to Methylophaga, Burkholderia, 
Methylomicrobium, Leptospira, Kordia, Haliscomenobacter, 
Roseobacter, Ahrensia, and Ralstonia were only found in marine 
metagenomes, whereas Marivirga was unique to the freshwater 
(Fig. Sll). 

CYP153 genes in metagenomes. Using similar analyses as applied for 
the alkB genes, 585, 43, and 332 CYP153 homologous genes were 
found in the terrestrial, freshwater and marine metagenomic data- 
sets, respectively. The relative abundance of candidate CYPl 53 genes 
in freshwater were much less than in terrestrial and marine meta- 
genomes (Fig. S12). The phylogenetic tree based on CYP153 
sequences showed that the CYP153 sequences were only from 
Proteobacteria and Actinobacteria. Among them, most terrestrial 
sequences were in Clusters III and IV, with sequences from 
Alphaproteobacteria. The remaining ones were distributed in 
Cluster II, related to Gammaproteobacteria (Fig. S13a). Freshwater 
CYP153 sequences were mainly distributed in Clusters III and IV, 
with those related to Alphaproteobacteria (Fig. S13b). Most marine 
CYP153 sequences were in Cluster II, affiliated with sequences from 
Gammaproteobacteria. The remaining sequences were mainly in 
Cluster III, with sequences from Alphaproteobacteria (Fig. SI 3c). 
Analysis of deduced protein sequences of CYP153 genes by 
BLASTP showed that only about 36.9%, 30.2%, and 11.1% of the 
total 585, 43, and 332 candidate CYPl 53 genes in the terrestrial, 
freshwater and marine metagenomic datasets, respectively, were 
>75% amino acid identical to the deposited genes in the NR database 
(Fig. S14). About 80.5%, 81.4%, and 75% of the total CYP153 
sequences in the respective terrestrial, freshwater and marine meta- 
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Figure 4 | Taxonomic distribution of alkane hydroxylases in freshwater, marine and terrestrial habitats. A: Taxonomic distribution of alkB genes. B: 
Taxonomic distribution of CYP153 genes. 



genomes could be assigned to the phylum level. Among these 
sequences assigned to a phylum, CYP153 genes related to 
Alphaproteobacteria were well distributed in all three environments. 
In contrast, CYP153 sequences related to Gammaproteobacteria were 
most enriched in marine environments (Fig. 4b). Only about 24.1%, 
11.6%, and 10.8% of total 585, 43, and 332 CYP153 sequences could 
be assigned to the genus level (Fig. S15), including Bradyrhizobium, 
Caulobacter, Phenylobacterium, and Parvibaculum belonging to 
Alphaproteobacteria in terrestrial metagenomes, Sphingopyxis in 
freshwater, and Parvibaculum, Hyphomonas, Bradyrhizobium, 
Caulobacter, and Sphingopyxis belonging to Alphaproteobacteria in 
marine metagenomes (Fig. S15). 

Discussion 

Unexpectedly diverse alkB and CYP153 genes were detected in dif- 
ferent environments and in so many bacterial genomes, including 
those that do not have the proven alkane hydroxylation functions 
like genus such as Conexibacter, Acidiphilium, Methylibium, and 
Leptospira (Table S5). Among these bacteria, some of them including 



Acidiphilium^", Parvularcula^' , Limnobacter^^, and Glaciecola", had 
been found in oil contaminated environments, whereas many other 
bacteria, including those with proven alkane degradation abilities, 
were not originally isolated from petroleum-contaminated environ- 
ments. Despite that nature is not always petroleum-contaminated, 
alkanes are in fact observed throughout nature although in low con- 
centrations"'''. They can be produced from fatty acid metabolites of 
plants, insects, and microorganisms. For example, to protect against 
water loss and pathogen infection, plants can produce cuticular waxes 
at the epidermal cells^'', which typically constitute 20-60% of the 
cuticle mass and are complex mixture of straight- chain C20-C60 
aliphatics^''. Insects produce pheromones with the hydrocarbon back- 
bone"'^. Alkanes have also been reported to be synthesized in diverse 
microorganisms, including cyanobacteria that were reported to pro- 
duce high proportions of heptadecane^" '°. Therefore, alkanes are 
always observed in natural habitats dominated by cyanobacteria". 

The ubiquitous presence of alkanes in nature could have played 
important roles in the evolution of hydroxylase genes in different 
environments, resulting in the detection of so many alkB and 
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CYP153 genes from both the microbial genomes and the various 
metagenomes. In addition, the different alkane availabilities, either 
in amount or in molecular structure, along with different microbial 
communities in environments could have enriched different alkB 
and CYP153 genes in different habitats. For example, the alkB genes 
were much less diverse in freshwater than in both marine and ter- 
restrial habitats, which might be a consequence of the much less 
alkane availability in freshwater environments. Similarly, the higher 
abundance of Actinobacteria in terrestrial than in marine environ- 
ments'^ '^' could be one of the reasons why more alkB genes related to 
Actinobacteria were found in terrestrial metagenomes (Fig. S7). 
Moreover, the soil microbial communities were shaped by a complex 
interaction of soil variables, such as salinity, pH, total carbon, and 
availability of oxygen. Among these variables, the presence of total 
petroleum hydrocarbons and different bioavailability of alkanes had 
a significant effect on the structure of alkane-degrading communit- 
jg^34.35 ■pjjg enrichment of Bacteroidetes alkB genes in freshwater, as 
well as the detection of different alkane degraders in the metagen- 
omes is likely attributed to the different availabilities of alkanes and 
the presence of different bacterial communities. However, further 
research is needed to investigate the exact reasons. Remarkably, dif- 
ferent distributions of alkB and CYP153 genes were also found 
between the microbial genomes and the metagenomes. For example, 
although a large number of alkB sequences were found in 
Acinetobacter genomes, no sequence related to Acinetobacter was 
found in any of the metagenomes. These inconsistencies could result 
from the difference between the culture-dependent and -independ- 
ent analytical approaches. A recent research on the effects of different 
methods used in analysis of soil microbial communities showed 
unexpected accessibility of the rare biosphere by culturing. Soil bac- 
teria captured by culturing, such as Pseudomonas, Rhodococcus, 
Arthrobacter and Flavobacterium which were abundant among the 
cultured organisms, were in very low abundance or absent in the 
culture independent community"". In contrast, some bacteria abund- 
ant in culture independent community were not cultured. It also 
suggested that many bacteria harboring alkB and CYP153 genes have 
not been isolated yet (Fig. S9 and S15). However, it can hardly be 
concluded that these usually isolated alkane-degrading strains that 
are less dominant in the various environments cannot play import- 
ant roles in alkane degradation, and vice versa, because the low- 
abundant bacteria could burst forth under certain environmental 
stresses, like an oil spUl, and be finally important"*^. 

Although members of Bacillus and Geobacillus were often detected 
in oil-related environments, and a number of Bacillus and 
Geobacillus strains were reported to be able to utilize long-chain 
n-alkanes as the sole carbon and energy source'*''*'^', neither the 
alkB gene nor the CYP153 gene was found in the 1,077 sequenced 
Firmicutes genomes, including 139 Bacillus, 10 Geobacillus and one 
Thermobacillus genomes. Moreover, no alkB or CYP153 homolog- 
ous genes from Firmicutes were found in all the three metagenomic 
datasets. Although it was reported that alkB genes were detected in 
alkane degraders belonging to Geobacillus, the alkB gene fragments 
had remarkable high amino acid sequence similarities with those in 
Rhodococcus, which suggested that alkB genes in Geobacillus were 
obtained via HGT from Rhodococcus or other Actinobacteria"*". It 
therefore seemed that alkB and CYP153 were not the key genes for 
alkane degradation in Bacillus and Geobacillus. Whether the alkane- 
degrading Firmicutes have other AHs needed to be further 
researched. At least one novel soluble long-chain alkane monooxy- 
genase (LadA) was found in Geobacillus thermodenitrificans NG80- 
2, catalyzing the first alkane hydroxylation reaction in the alkane- 
degrading pathway'". 

The generation of novel functions is critical for microorganisms in 
order to be able to respond to environmental and evolutional chal- 
lenges. Gene duplication, HGT, and gene fusion/fission are common 
evolutionary processes that generate novel genes or functions for 



rapid adaptation'""'"'. Among the genes, rRNA genes are thought to 
be the most conserved, and the rRNA-based phylogeny is considered 
to be robust and consistent with the genome phylogeny"*^. In general, 
highly similar clustering of Actinobacteria and Betaproteobacteria 
were found between the 16S rRNA and AlkB trees, as were the 
orderings of Gammaproteobacteria between the 16S rRNA and 
CYP153 trees. The results indicated that a/fcB/CYP153 genes might 
have occurred earlier than the speciation events and were inherited 
from their ancestors. Major inconsistences between the 16S rRNA 
and AlkB trees were found for some members from Bacteroidetes, 
Gammaproteobacteria and Alphaproteobacteria. In contrast, major 
inconsistencies between the 16S rRNA and CYP153 trees were found 
for all members from Actinobacteria, which were shown as the out- 
group of Proteobacteria in the 16S rRNA tree but clustered with some 
members from Gammaproteobacteria in the CYP153 tree (Fig. 3). 
The inconsistencies suggest different origins of these fl/fcB/CYP153 
genes by different processes. 

Paralogs are homologs that arise through gene duplication events. 
They usually form two groups sharing identical phylogenies in the 
phylogenetic tree, and are connected by a branch that indicates their 
last common ancestor'"'. Based on these definitions, both alkB and 
CYP153 paralogs were found in some genomes containing multiple 
AH genes. For example, alkB paralogs were found in Pseudomonas 
aeruginosa (Cluster VI) and Alphaproteobacteria (Cluster VII) (Fig. 
S6). AlkB sequences from Alphaproteobacteria in Cluster VII were 
clustered into two groups. Except for the genomes of Pelagibaca 
bermudensis HTCC2601, Octadecabacter arcticus 238 and 
Rhodobacter capsulatus SB 1003, genomes in group 2 always had 
more than two alkB genes: one was in group 2 and the others in 
group 1. Although in two groups, these genes shared similar G + C 
content with their host genomes, clustered together, and were distant 
from other clusters of the alkB genes from other bacterial classes. 
Moreover, genomes from Pseudomonas aeruginosa strains contained 
two copies of the alkB gene. They clustered into two groups that were 
distant from other bacterial taxa and shared similar G+C contents 
with their host genomes, indicating that they might be paralogs 
separated during gene duplication. Similarly, CYP153 genes in 
Cluster IV were possibly paralogous genes because of the deep bifur- 
cations and long branch lengths (Fig. 3b), which indicated the more 
rapid evolutionary rate than orthologous genes.The paralogous 
genes may have different abilities and functions. For example, 
alkWl and alkW2 are two paralogous genes encoding ADcB-rubre- 
doxin fusion proteins found in Dietzia sp. DQ12-45-lb. Functional 
research showed that alkWl could hydroxylate «-alkanes ranging 
from C14 to C32, whereas alkW2 was not expressed". 

HGT is a potent evolutionary force in prokaryotes. Genes acquired 
by HGT can be predicted by comparing the differences of G+C 
content, codon usage, phylogenies between the candidate genes 
and the whole genomes, as well as analyzing the flanking mobile 
genetic elements'*'^'™. It is common for catabolic genes to undergo 
genetic rearrangements, such as insertions, deletions, duplications 
and inversions, which is attributable to the presence of elements that 
possess the ability to mobilise the catabolic genes^'. One example is 
Pseudomonas putida GPol alkB gene that located in the OCT plas- 
mid^^''"''. The G+C content of GPol alkB gene is much lower than 
that of the host strain and the OCT plasmid, and it is flanked by the 
insertion sequence ISPpu4, constituting a class 1 transposon, sug- 
gesting this gene is part of a mobile element'"'' and obtained from 
some closely related Alcanivorax strains^"*. The possible HGT of alkB 
genes may explain why several alkB genes from Pseudomonas are 
distributed in several different clusters, which is also proposed 
before^l Interestingly, almost all the Actinobacteria genomes con- 
taining CYP153 genes had alkB genes, suggesting a potential link 
between the CYP153 and alkB genes in the Actinobacteria. It is 
reasonable that bacteria need not only AHs, but also enzymes for 
sensing, taking up, and emulsifying alkanes to degrade them. 
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Successful HGT of both the alkB and CYP153 genes could therefore 
more easily occur between alkane degraders which originally have 
the entire alkane degradation-related systems^'. 

Gene fusion, in which two previously independent genes fuse to a 
single contiguous open reading frame, is thought to be an important 
source of evolutionary novelty. It contributed significantly to the 
rapid evolution of novel biological fiinctions^'. Although the sub- 
strate length of AlkB homologous AHs were often less than CI 6, 
recent research has proven that the AlkB-rubredoxin fusion could 
enlarge the long-chain n-alkane degradation spectrum. The fusion of 
the two genes could be beneficial either for electron transfer or for 
substrate-enzyme binding". In general, bacterial P450s usually need 
ferredoxin and ferredoxin reductase for electron transfer. Self-suf- 
ficient P450s with other reductase domains were characterized by 
their remarkable enzymatic activities^* supporting that gene 
fusion could be beneficial for the evolution of novel functions. In 
this study, the fusion of AlkB, ferredoxin and ferredoxin reductase 
was detected for the first time in Leptospira, Limnobacter and 
Polaromonas. Although rubredoxin and rubredoxin reductase could 
be replaced in vitro by ferredoxin and ferredoxin reductase, respect- 
ively"'", further study is needed to address whether ferredoxin and 
ferredoxin reductase could function as rubredoxin and rubredoxin 
reductasein vivo, or vice versa. 

Although rapid evolutionary evidences like gene duplication and 
HGT could explain the inconsistences of AHs and 16S rRNA phy- 
logenies, some inconsistent branches could not be explained, at least 
by recent evolutionary events. For example, AlkB sequences related 
to Bacteroidetes were embedded in the Proteobacteria and clustered 
with Gammaproteobacteria in the AlkB-based phylogenetic tree 
(Cluster V). GI prediction and analysis of the G + C content could 
not reveal an obvious recent HGT event of alkB genes both in 
Bacteroidetes and Gammaproteobacteria. Neither were the paralogs 
found in these cases. AlkB sequences from Pseudomonas fluorescens 
and Pseudomonas aeruginosa separated into two distinct clusters, but 
no obvious evidence was found to support the recent rapid evolu- 
tionary events like gene duplication and HGT in these strains. It 
suggested that alkB genes in these strains might occur after the spe- 
ciation events. However, the detailed mechanisms need to be further 
studied. 

In a summary, hundreds of putative alkB and CYP153 genes, most 
of which are novel ones with lower identity to the known genes, were 
retrieved by mining the released microbial genome and metagenome 
databases. They were found in Proteobacteria, Actinobacteria, 
Bacteroidetes, Spirochaetes and Planctomycetes, but not in archaea. 
The rapid evolutionary events like HGT, gene duplication, and gene 
fusion likely contributed to the diversity of both the alkB and 
CYP153 genes. Moreover, alkB and CYP153 genes, with many being 
unknown and less common, were found distributing differently in 
the terrestrial, freshwater and marine environments, suggesting their 
potential contributions to alkane metabolism in nature. Although the 
functions and evolution of AH genes and the mechanism of how 
organisms and genes are selected in different habitats need to be 
further researched, our work provides an overall profile of the alkB 
and CYP153 gene distributions in microbes and environments which 
can help to understand the gene and microbial functions toward 
alkane degradation in different environments. 

Methods 

Generation of protein sequence database. All the available microbial genomic data 
(up to September of 2012) including their predicted open reading frames were 
downloaded from the NCBI genome database in September 2012, which consisted of 
2,069 complete and 1,910 draft microbial genomes representing 784 different genera. 
Among them, Archaea had 134 complete and 18 draft genomes, representing 72 
genera. A total of 6,534,587 and 8,864,443 protein sequences were obtained from all 
the complete and draft genomes, respectively. Finally, a microbial proteomic database 
containing 15,399,030 protein sequences was built. 



Search for alkane hydroxylases in the microbial genomes. First, protein sequences 
of 28 AlkB and 18 CYP153 functionally characterized enzymes were selected as 
references. They were aligned using the MUSCLE program^^ and the alignment was 
manually adjusted. The aligned sequences were then used to construct a first profile 
Hidden Markov Model (pHMM)^^ for AlkB and CYP153, respectively, by using 
HMMER3 software package^*. 

The first AlkB and CYP153 pHMMs were then used to search the AlkB and 
CYP153 sequences in the microbial proteomic database with 15,399,030 protein 
sequences by using the HMMER3 software. The resulting positive hits of both AlkB 
and CYP153 were then aligned with the MUSCLE program and manually filtered 
with the following criteria: for AlkB proteins, hit sequences without three histidine 
boxes and HYG motif were excluded*'^^; for CYP153 proteins, the hit sequences were 
compared against the bacterial cytochrome P450 database (http://drnelson.uthsc. 
edu/ CytochromeP450.html), and those not only with the best hits but also >40% 
identity to CYP153 family members in the database were selected^^. The new CYP153 
protein sequences were added to update the bacterial P450 database. Finally, the 
obtained AlkB and CYP153 protein sequences with their respective reference 
sequences were aligned and applied to build the second pHMMs, respectively. The 
second pHMMs were used to again search against the microbial proteomic database 
for new AlkB and CYP153 proteins as described above, repeating in the same manner 
until no new sequence was found. The obtained sequences were then used for the 
following analyses, including phylogenetic and evolutionary analyses. To identify the 
possible consei-ved domains, the obtained proteins were analyzed using the Simple 
Modular Architecture Research Tool (SMART, http://smart.embl-heidelberg.de)^^. 

The GenBank accession numbers or Gene ID of all the AH genes identified in this 
study were shown in Table S7 and Table S8. 

Phylogenetic analysis. The 16S rRNA genes were extracted from the genome 
database, as well as from the Ribosomal Database Project^^ and Silva database^^. They, 
along with the AH sequences obtained above, were used to construct the phylogenetic 
trees using ARB™ by neighbor-joining algorithm^\ The stability of tree topology was 
evaluated by bootstrap resampling with a total of 1,000 replicates in all cases. The 16S 
rRNA gene from Meihanohacterium curvum, XylM protein from Pseudomonas 
putida mt-2 and P450cam protein were selected as the outgroups for building the 16S 
rRNA, AlkB and CYP153 phylogenetic trees, respectively. 

Analysis of horizontal gene transfer events. The AlkB and CYP153 phylogenies 
were compared against the 16S rRNA gene-based trees. The inconsistent species in 
the two trees were subjected to the analysis of genomic islands (GIs) using 
IslandViewer (http://www.pathogenomics.sfu.ca/islandviewer)^^. The G + C contents 
of these inconsistent sequences were also calculated to compare with their host 
genomes. Genes located in predicted GIs or with considerable G + C content 
differences were postulated to have high HGT potentials. 

Search for alkane hydroxylases in different metagenomes. The total thus-far 
available (up to September 2012) 137 assembled metagenome datasets including their 
open reading frames from the IMG/M database^^ were downloaded, 42, 35 and 60 of 
which were from terrestrial, freshwater and marine environments, respectively. The 
phylogenetic information of each sample was also downloaded from the "Radial Tree 
Distribution" in IMG/M and the average microbial compositions were calculated. For 
example, for calculating the average microbial composition of terrestrial 
metagenomes, the number of hits for each different taxon (Phylum level) in each 
terrestrial metagenome were summarized, and then divided by the number of total 
hits in all terrestrial metagenomes. In total, 20,162,506, 4,999,959 and 9,254,226 
proteins were obtained from these respective selected terrestrial, freshwater and 
marine metagenomes, respectively. Similarly, the above- generated AlkB and CYP153 
pHMMs were used to search against the metagenome protein databases using the 
HMMER3 software. The positive hits were aligned and manually filtered as described 
above. The obtained sequences were compared against the NCBI NR database using 
BLASTP and the BLAST results were imported into MEGAN4^'' for taxonomic 
analysis using the last common ancestor based algorithm. Phylogenetic trees were 
built based on the sequences from metagenomic data together with those from 
microbial genomes using ARB, as described above. 
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